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MINIMAX SPARSE PRINCIPAL SUBSPACE ESTIMATION 
IN HIGH DIMENSIONS 

By Vincent Q. Vu* and Jing Lei''' 
The Ohio State University and Carnegie Mellon University 

We study sparse principal components analysis in high dimen- 
sions, where p (the number of variables) can be much larger than n 
(the number of observations) , and analyze the problem of estimating 
the subspace spanned by the principal eigenvectors of the popula- 
tion covariance matrix. We prove optimal, non-asymptotic lower and 
upper bounds on the minimax subspace estimation error under two 
different, but related notions of t q subspace sparsity for < q < 1. 
Our upper bounds apply to general classes of covariance matrices, 
and they show that £ q constrained estimates can achieve optimal 
minimax rates without restrictive spiked covariance conditions. 

1. Introduction. Principal components analysis (PCA) was introduced 
in the early 20th century (Pearson, 1901; Hotelling, 1933) and is arguably 
the most well known and widely used technique for dimension reduction. 
It is part of the mainstream statistical repertoire and is routinely used in 
numerous and diverse areas of application. However, contemporary appli- 
cations often involve much higher-dimensional data than envisioned by the 
early developers of PCA. In such high-dimensional situations, where the 
number of variables p is of the same order or much larger than the number 
of observations n, serious difficulties emerge: standard PCA can produce 
inconsistent estimates of the principal directions of variation and lead to 
unreliable conclusions (Johnstone and Lu, 2009; Paul, 2007; Nadler, 2008). 

The principal directions of variation correspond to the eigenvectors of 
the covariance matrix, and in high-dimensions consistent estimation of the 
eigenvectors is generally not possible without additional assumptions about 
the covariance matrix or its eigenstructure. Much of the recent development 
in PCA has focused on methodology that applies the concept of sparsity to 
eigenvector estimation (some examples include Jolliffe, Trendafilov and Uddin, 
2003; d'Aspremont et al., 2007; Zou, Hastie and Tibshirani, 2006; Shen and Huang, 
2008; Witten, Tibshirani and Hastie, 2009). Theoretical developments on 
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sparsity and PCA include Johnstone and Lu (2009); Amini and Wainwright 

(2009); Shen, Shen and Marron (2011); Ma (2011); Vu and Lei (2012a); Birnbaum et al. 

(2012). 

An open problem that has remained is whether sparse PCA methods can 
optimally estimate the subspace spanned by the leading eigenvectors, i.e. the 
principal subspace of variation. The subspace estimation problem is directly 
connected to dimension reduction and is important when there is more than 
one principal component of interest. Indeed, typical applications of PCA 
use the projection on to the principal subspace to facilitate exploration and 
inference of important features of the data. In that case the assumption 
that there are distinct principal directions of variation is mathematically 
convenient but unnatural. 

In this paper we study principal subspace estimation by sparse PCA 
in high-dimensions. We present non-asymptotic minimax lower and upper 
bounds with optimal dependence on the parameters of the problem. As an 
illustration, one consequence of our results is that the order of the minimax 
mean squared estimation error of the d-dimensional principal subspace is, 
ignoring constant factors, 

R q f — {d + \ogp)\ , 0<g<l, 

where a 2 is a measure of the noise-to-signal ratio and R q is a measure of 
the sparsity in an l q sense defined in Section 2. The d + logp factor is novel 
and it reflects two complementary aspects of the problem: d for parametric 
estimation and logp for variable selection. 

We obtain the minimax upper bound by analyzing a sparsity constrained 
principal subspace estimator and showing that it attains the optimal error 
(up to a constant factor). In comparison to most existing analyses in the 
literature, we show that the upper bound holds without assuming a spiked 
covariance model. A key technical ingredient in our analysis of the subspace 
estimator is a novel variational form of the Davis-Kahan sin Theorem (see 
Lemma 5.2) that allows us to bound the estimation error using some recent 
advances in empirical process theory. The minimax lower bound follows the 
standard Fano method framework, but involves nontrivial constructions of 
packing sets in the Stiefel Manifold. 

Our results provide the first and optimal minimax lower bound for sparse 
principal subspace estimation. To our knowledge, the only other work that 
has considered sparse principal subspace estimation is that of Ma (2011) 
on the rate of convergence of an iterative thresholding estimator. However 
their analysis depends on assuming a spiked covariance model and even then 
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the rate of convergence has suboptimal dependence on the dimension of the 
principal subspace. 

The remainder of the paper is organized as follows. In the next section, 
we introduce the sparse principal subspace estimation problem and formally 
setup our minimax framework and estimator. In Section 3 we present our 
main conditions and results, and provide a brief discussion about their con- 
sequences and intuition. Sections 4 and 5 contain the major steps in proving 
the lower and upper bounds. The major steps in the proofs require some 
auxilliary lemmas whose proofs we defer to Appendices A and B. Section 6 
closes the paper with discussion of our results and open problems. 

2. Subspace estimation. Let X\, . . . ,X n G W be independent, iden- 
tically distributed random vectors with mean \x and covariance matrix S. To 
reduce the dimension of the AQ's from p down to d, PC A looks for d mutually 
uncorrelated, linear combinations of the p coordinates of AQ that have max- 
imal variance. Geometrically, this is equivalent to finding a d-dimensional 
linear subspace that is closest to the centered random vector Xi — /j, in a 
mean squared sense 1 , and it corresponds to the optimization problem 

^ 2 minimize E\\(I P - U g )(Xi - fi)\\ 2 2 

subject to Q € & Pj d , 

where G P) d is the Grassmann manifold 2 of d-dimensional subspaces of W, 
Hg is the projection onto Q, and I p is the p x p identity matrix. There is 
always at least one d < p for which eq. (2.1) has a unique solution. That 
solution can be determined by the spectral decomposition 

p 

(2.2) £ = ^v jV j , 

where Ai > A2 > • • • > A p > are the eigenvalues of £ and v±, . . . ,v p € W, 
orthonormal, are the associated eigenvectors. If A^ > A^+i, then the d- 
dimensional principal subspace of E is 

(2.3) S = span{ui, . . . ,v d } , 

and the projection onto S is given by IL5 = VV T , where V is the p x d 
matrix with columns v\, . . . ,Vd- 

1 This is essentially the viewpoint of Pearson (1901). 

2 For background on Grassmann and Stiefel manifolds, see Edelman, Arias and Smith 
(1998) and Chikuse (2003). 
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In practice, £ is unknown, so S must be estimated from the data. Standard 
PCA replaces eq. (2.1) with an empirical version. This leads to the spectral 
decomposition of the sample covariance matrix 

1 n 

S n = -Y j (X i -X){X i -X) T , 

i=l 

where X is the sample mean, and estimating S by the span of the leading d 
eigenvectors of S n . In high-dimensions, however, the eigenvectors of S n can 
be inconsistent estimators of the eigenvectors of X. Additional structural 
constraints are necessary for consistent estimation of S. 

2.1. Subspace sparsity. The notion of sparsity is appealing and has been 
used successfully in the context of estimating vector valued parameters such 
as the leading eigenvector in PCA. Extending this notion to subspaces re- 
quires care because sparsity is inherently a coordinate-dependent concept 
while subspaces are coordinate-independent. For a given tf-dimensional sub- 
space Q £ G„ t d, the set of orthonormal matrices whose columns span Q is a 
subset of the Stiefel manifold V p ^ of p x d orthonormal matrices, and are 
equal up to multiplication on the right by an orthogonal matrix. We will 
consider two complementary notions of subspace sparsity defined in terms 
of those orthonormal matrices: row sparsity and column sparsity. 

Define the (2, q)-norm 3 , q > 0, of a p x d matrix A as 



. = J £?=i [Et=i a %\ * if Q > 0, and 

where aj* denotes the jth row of A. Note that is coordinate-independent, 
because ||AO||2g = ||^4||2,o for any orthogonal matrix O £ M rfxrf . We define 
the row sparse subspaces using this norm. 



Definition (Row sparse subspaces). For q > and R q > d 
M q {R q ) :- 

where span(C7) denotes the span of the columns of U. 



{ span([/) : U £ Y p4 and ||J7||| i3 < R q ] if q > 0, and 
{ span([/) : U G V p4 and ||Z7|| 2 ,o < -Ro} if g = . 



To be precise, this is actually a pseudonorm when q < 1. 
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Roughly speaking, row sparsity asserts that there is a small subset of 
variables (coordinates of M p ) that generate the principal subspace. Since 
|| "Ha g is coordinate-independent, every orthonormal basis of a row sparse Q 
has the same (2, g)-norm. Column sparsity, on the other hand, asserts that 
there is some orthnormal basis of sparse vectors that spans the principal 
subspace. Define the (*, g)-norm, q > 0, of a p x d matrix A as 



I auq ._ } max i<k<dYl P j=i\ a jk\ 9 if Q > 0, and 
[max!< fc < d Yj j= i l {a jk ^o} if g = . 



I *,g 



This is the maximum of the £ q norms of the columns of A and is not 
coordinate-independent. We define the column sparse subspaces to be those 
that have some orthonormal basis with small (*,q)-norm. 

Definition (Column sparse subspaces). For q > and R q > 1, 

( I sr>fm/7n • TT a V_ j nnrl \\TT\\ q _ < Ti . \ if 

M* q (R q ) :-- 



{ span(f7) : U G V Pi d and ||C/||*, 9 < R q , } if q > 0, and 
{ span([/) : C/ G V P) d and \\U\\* >0 < R , } if <? = . 



The column sparse subspaces are the d-dimensional subspaces that have 
some orthonormal basis whose vectors are l q sparse in the usual sense. Unlike 
row sparsity, the orthonormal bases of a column sparse Q do not all have 
the same (*,(/)-norm, but if Q € Ai* q (R q ), then there exists some U G V Pj( f 
such that Q = span(C7) and ||?7||* )9 < R q (or ||C/||*,o < Rq for q = 0). 

2.2. Parameter space. We assume that there exists i.i.d. random vectors 
Zi, . . . , Z n G W, with EZi = and Var(Zi) = I p , such that 

(2.4) Xi = ft + Y}l 2 Zi and \\Zi\\^ < 1 , 

for i = 1, . . . ,n, where ||-||^ a is the Orlicz ^ a -norm 4 defined for a > 1 as 





% 2} 


C 





:= sup inf < C > : Eexp 
6:||6|| 2 <1 I 

This ensures that the distribution of the X^s is sub-Gaussian. We also as- 
sume that the eigengap — A^+i > so that the principal subspace S is 
well-defined. Intuitively, S is harder to estimate when the eigengap is small. 
This is made precise by the noise-to- signal ratio 

2 ._ ^l^d+l 



(2.5) a 

4 See van der Vaart and Wellner (1996, Chapter 2) for more information on the Orlicz 
^ a -norm. 
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It turns out that a 2 is a key quantity in the estimation of S, and that it is 
analogous to the noise variance in linear regression. Let 

V q (a 2 ,R q ) 

denote the class of distributions on X±, . . . , X n that satisfy eq. (2.4), eq. (2.5), 
and S G M. q (R q ). Similarly, let 

denote the class of distributions that satisfy eq. (2.4), eq. (2.5), and S G 
M*(Rq). Throughout this paper, we consider estimating S over V q (a 2 ,R q ) 
and V*{a 2 ,R q ). 

2.3. Subspace distance. A notion of distance between subspaces is nec- 
essary to measure the performance of a principal subspace estimator. The 
canonical angles between subspaces generalize the notion of angles between 
lines and can be used to define subspace distances. There are several equiva- 
lent ways to describe canonical angles, but for our purposes it will be easiest 
to describe them in terms of projection matrices. 5 For a subspace £ G G Pj ^ 
and its orthogonal projection E, we write E 1 - to denote the orthogonal pro- 
jection onto £ ± and recall that E 1 - = I p — E. 

Definition. Let £ and J- be d-dimensional subspaces of W with or- 
thogonal projections E and F. Denote the singular values of EF 1 - by s\ > 
s 2 > • • • . The canonical angles between £ and T are the numbers 

6k(£,F) = arcsin(sfc) 

for k = l,...,d and the angle operator between £ and T is the dx d matrix 

Q(£,F) = dm g (e u ...,e d ). 

In this paper we will consider the following distance between subspaces 
£, T G G Pt d- 

\\smQ(£,F)\\ F 

where \\-\\f is the Frobenius norm. This distance is indeed a metric on G Pi d 
(see Stewart and Sun, 1990, for example), and can be connected to the fa- 
miliar Frobenius (squared error) distance between projection matrices by 
the following following well-known fact from matrix perturbation theory. 

5 We refer the reader to Bhatia (1997, Chapter VII.l) and Stewart and Sun (1990) for 
additional background on canonical angles. 
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Proposition 2.1 (see Stewart and Sun (1990), Theorem 1.5.5). Let £ 

and J- be d- dimensional subspaces of MP with orthogonal projections E and 
F. Then 

1. The singular values of EF 1 - are 

st,s 2 , ■ ,s d ,0, . . . ,0. 

2. The singular values of E — F are 

s\,s 1 ,s 2 , s 2 ,...,s d ,s d ,0,...,0. 

In other words, EF 1 - has at most d nonzero singular values and the nonzero 
singular values of E — F are the nonzero singular values of EF 1 - , each 
counted twice. 

Thus, 

(2.6) ||sine(f ,^)||| = \\EF^\\ 2 F = l -\\E - Ff F = \\E L Ff F . 

We will frequently use these identities. For simplicity, we will overload no- 
tation and write 

sin(£/i, U2) '■= sin ©( span(C/i), span(C/2)) 

for U\,U2 £ V Pj d- We also use a similar convention for sin(£', F), where E, F 
are the orthogonal projections corresponding to £, J- £ G Pt d The following 
proposition, proved in the Appendix, relates the subspace distance to the 
ordinary Euclidean distance between orthonormal matrices. 

Proposition 2.2. IfV\, V 2 £ V p>rf , then 

\ inf ||Vi-V2Q||!< ||sin(Fi,y 2 )|||< inf \\V X -V 2 Q\\ 2 F . 

In otherwords, the distance between two subspaces is equivalent to the 
distance between their orthonormal bases, up to some rotation. 

2.4. Sparse subspace estimators. Here we introduce two estimators that 
achieves the optimal (up to a constant factor) minimax error for sparse sub- 
space estimation. To estimate a sparse subspace, it is natural to consider the 
emprical minimization problem corresponding to eq. (2.1) with an additional 
sparsity constraint corresponding to either M q {R q ) or M* q (R q ). 
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We define the row sparse principal subspace estimator to be a solution of 
the following constrained optimization problem. 



(2.7) 



^^2\\(i p -ng)(Xi-x)\\ 2 2 

=1 

subject to Q G M q (R q ) . 

For our analysis it is more convenient to work on the Stiefel manifold. 
Let (A,B) := Tr(A T B) for matrices A,B of compatible dimension. It is 
straightforward to show that following optimization problem is equivalent 
to eq. (2.7). 

maximize (S n , UU T ) 
(2.8) V ' ' , „ ||0 

subject to U G ^ P ,d and \\U\\2 q < Rq ■ 

If V is a solution of eq. (2.8). Then span(y) is a solution of eq. (2.7). The 
feasible set of both problems is nonempty when R q > d and the sparsity 
constraint is active only when R q < d q l 2 p 1 ~ q l 2 . When q = 1, the estimator 
defined by eq. (2.8) is essentially a generalization to subspaces of the Lasso- 
type sparse PCA estimator proposed by Jolliffe, Trendafilov and Uddin (2003). 
A similar idea has also been used by Chen, Zou and Cook (2010) in the con- 
text of sufficient dimension reduction. This estimator appears to be compu- 
tationally intractable, because it involves a convex maximization problem. 

In the column sparse case, we define the column sparse principal subspace 
estimator analogously to the row sparse principal subspace estimator, using 
the column sparse subspaces M*(R q ) instead of the row sparse ones. This 
leads to the following equivalent Grassmann and Stiefel manifold optimiza- 
tion problems. 



(2.9) 



1 - 

J2\\(i P -^)(Xi-x)g 



minimize 

n. 

i=l 

subject to Q G Mg(Rq) . 



maximize (S n , UU T ) 
(2.10) X ' 

subject to U G and ||J7|I* g — Rq 

Like the row sparse estimator, the column sparse principal subspace estima- 
tor does not appear to be computationally tractable either. 

3. Main results. In this section we present our main results on the 
minimax lower and upper bounds of sparse principal subspace estimation 
over the row sparse and column sparse classes. 
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3.1. Minimax lower bounds. To highlight the key results with minimal 
assumptions, we will first consider the simplest case where q = 0. Consider 
the following two conditions. 

Condition 1. There is a constant M > such that 

< M . 

Condition 2. 4 <p — d and 2d< R q — d< (p - d) x ~i . 

Condition 1 is necessary for the existence of a consistent estimator (see 
Theorems 4.1 and 4.2). Without Condition 1, the statements of our results 
would be complicated by multiple cases to deal with the fact that the sub- 
space distance is bounded above by \fd. The lower bounds on p—d and R q — d 
are minor technical conditions that ensure our non-asymptotic bounds are 
non-trivial. Similarly, the upper bound on R q — d is only violated in trivial 
cases. 

Theorem 3.1 (Row sparse lower bound, q = 0). If Conditions 1 and 2 
hold, then 

2 

inf sup E||sm6(5,5)||| > c(R - d) — 

§ Vo{a 2 ,Ro) n 

Here, as well as in the entire paper, c denotes universal, positive con- 
stant, not necessarily the same at each occurrance. This lower bound result 
reflects two separate aspects of the estimation problem: variable selection 
and parameter estimation after variable selection. Variable selection refers 
to finding the variables that generate the principal subspace, while estima- 
tion refers to estimating the subspace after selecting the variables. For each 
variable, we accumulate two types of errors: one proportional to d that re- 
flects the coordinates of the variable in the d-dimensional subspace, and one 
proportional to log[(p — d)/(Ro — d)] that reflects the cost of searching for 
the Rq active variables. We prove Theorem 3.1 in Section 4. 

The non-asymptotic lower bound for < q < 2 has a more complicated 
dependence on (n, p, d, R q , a 2 ) because of the interaction between l q and 
H.2 norms. Therefore, our main lower bound result for < q < 2 will focus 
on values of (n, p, d, R q , a 2 ) that correspond to the high-dimensional and 
sparse regime. (We will state more general lower bound results in Section 4.) 
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Let 



(3.1) 



(p - d)a 2 



and T : 



R q -d 



7 := 



(p-d) 1 



n 



The interpretation for these two quantities is natural. First, T measures 
the relative sparsity of the problem. It ranges between and 1, though the 
"sparse" regime generally corresponds to T < 1. The second quantity, 7 
corresponds to the classic mean squared error (MSE) of standard PCA. The 
problem is low-dimensional if 7 is small and there is not much sparsity. We 
impose the following condition to preclude this case. 

Condition 3. There is a constant a < 1 such that T a < 72 . 

This condition lower bounds the classic MSE in terms of the sparsity 
and is mild in high-dimensional situations. When a = q/2, for example, 
Condition 3 reduces to 



We also note that this assumption becomes milder for larger values of a and 
it is related to conditions in other minimax inference problem involving t v 
and i q balls (see Donoho and Johnstone, 1994, for example). 

Theorem 3.2 (Row sparse lower bound, < q < 2). Let q G (0,2). // 
Conditions 1 to 3 hold, then 



This result generalizes Theorem 3.1 and reflects the same combination 
of variable selection and parameter estimation. When Condition 3 does not 
hold, the problem is outside of the sparse, high-dimensional regime. As we 
show in the proof, there is actually a "phase transition regime" between 
the high-dimensional sparse and the classic dense regimes for which sharp 
minimax rate remains unknown. A similar phenomenon has been observed 
in Birnbaum et al. (2012). 

By modifying the proof of Theorem 3.1 and Theorem 3.2 we can obtain 
results for the column sparse case that are parallel to the row sparse case. 
For brevity we present the q = and q > cases together. The analog of T 
for the column sparse case is 



R q -d< —(p-d) 



inf sup E||sin 9(5,5) ||! > c(R q - d) \ — d + \og 
s v q (**,R q ) I n 




(3.2) 



T* := 



d(R q - 1) 
(p-d) 1 ^ 
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and the analogs of Conditions 2 and 3 are the following. 

Condition 4. Ad < p - d and d < d{R q - 1) < {p - d) l ~ I . 

Condition 5. There is a constant a < 1 such that T" < 72. 

Theorem 3.3 (Column sparse lower bound). Let g £ [0, 2). If Conditions 4 
and 5 hold, then 



a 1 



inf sup E||sin(«S,«S)||> > cd(R q - 1) I — 



s V*{a\R q ) 



n 



1 + log u 



d{R q - 1) 



1-2 
x 2 



For column sparse subspaces, the lower bound is dominated by the vari- 
able selection error, because column sparsity is defined in terms of the max- 
imal £q norms of the vectors in an orthonormal basis and Rq variables must 
be selected for each of the d vectors. So the variable selection error is inflated 
by a factor of d. We prove Theorem 3.3 in Section 4. 

3.2. Minimax upper bounds. Our upper bound results are obtained by 
analyzing the estimators given in Section 2.4. The case where q = is the 
clearest, and we begin by stating a weaker, but simpler minimax upper 
bound for the row sparse class. 

Theorem 3.4 (Row sparse upper bound in expectation). Let S be any 
solution of eq. (2.7). // 6a/ Ro(d + logp) < ^fri, then 

Ai a 2 {d + logp)Y 



sup E||sin 0(5, 5)11^ < cyi?o 



Although eq. (2.7) may not have a unique global minimum, Theorem 3.4 
shows that any global minimum will be within a certain radius of the princi- 
pal subspace S. The proof of Theorem 3.4, given in Section 5.2, is relatively 
simple but still nontrivial. It also serves as a prototype for the much more 
involved proof of our main upper bound result stated in Theorem 3.5 below. 
We note that the rate given by Theorem 3.4 is off by a \f\i/)^d+i factor 
that is due to the specific approach taken to control an empirical processes 
in our proof of Theorem 3.4. 

To state the main upper bound result with optimal dependence on (n, p, 
d, R q ,a 2 ), we first describe some regularity conditions. Let 



V2R, 



-— - 

d + log p \ 2 4 
n 



imsart-aos ver. 2012/08/31 file: minimax-sparsesubspace.tex date: November 5, 2012 



12 V. Q. VU AND J. LEI 

The regularity conditions are 
(3.3) e n < 1 , 



eW-lognAi + c 3 e n (logn) 5/2 X d+1 < ]-(X d - X d+1 ) 
V n 2 



(3.4) 

V lb 

(3.5) c 3 e n (lognf/ 2 X d+l < y/xj^~ q/2 {X d - \ d+l ) q ' 2 , and 

(3.6) c 3 e 2 (log n) 5 / 2 A rf+1 < s/xJ^~ q {X d - X d+1 )~^ , 
where c\ and c 3 are positive constants given. 

Theorem 3.5 (Row sparse upper bound in probability). Let q G [0, 1] 
and S be any solution of eq. (2.7). // (X\, . . . ,X n ) ~ P G V q (a 2 ,R q ) and 
eqs. (3.3) to (3.6) ZioZd, then 



l 'a 2 (d + logp)\ 
\smi3(S,S)\\ F < cR { 1 



n 

with probability at least 1 — 4/(n — 1) — 61ogn/n — p . 

Theorem 3.5 is presented in terms of a probability bound instead of an 
expectation bound. This stems from technical aspects of our proof that 
involve bounding the supremum of an empirical process over a set of random 
diameter. The upper bound matches our lower bounds (Theorem 3.1 and 
Theorem 3.2) for the entire tuple (n, p, d, R q ,o~ 2 ) up to a constant if 



#2/(2-9) < 



q 



P 



for some constant c < 1. The proof of Theorem 3.5 is in Section 5.2. By 
observing that A4*(R q ) C J\A q (dR q ), we can reuse the proof of Theorem 3.5 
to derive the following upper bound for the column sparse class. 

Corollary 3.1 (Column sparse upper bound). Let q G [0, 1] and S be 
any solution of eq. (2.9). If (X% , X n ) ~ P G V*{a 2 ,R q ) and eqs. (3.3) 
to (3.6) hold with R q replaced by dR q , then 

1 • rvc CM12 / ( o- 2 (d + \ogp)^ 
I sin B(o, S)\\p < cdR, 



n 

with probability at least 1 — 4/(n — 1) — 61ogn/n — p~ l 

Corollary 3.1 is slightly weaker than the corresponding result for the row 
sparse class. It matches the lower bound in Theorem 3.3 up to a constant if 

(d(R q - l)fl^) < p c 
for some constant c < 1, and d < Clogp for some other constant C. 



imsart-aos ver. 2012/08/31 file: minimax-sparsesubspace.tex date: November 5, 2012 



MINIMAX SPARSE PRINCIPAL SUBSPACE ESTIMATION 



13 



4. Lower bound proofs. Theorems 3.1 to 3.3 are consequences of 
three more general results stated below. An essential part of the strategy 
of our proof is to analyze the variable selection and estimation aspects of 
the problem separately. We will consider two types of subsets of the param- 
eter space that capture the essential difficulty of each aspect: one where the 
subspaces vary over different subsets of variables, and another where the 
subspaces vary over a fixed subset of variables. The first two results give 
lower bounds for each aspect in the row sparse case. Theorems 3.1 and 3.2 
follow easily from them. The third result directly addresses the proof of 
Theorem 3.3. 

Theorem 4.1 (Row sparse variable selection). Let q € [0, 2) and (p, d, R q ) 
satisfy 

4 <p - d and 1 < R q - d < (p - d) l ~i . 
There exists a universal constant c > such that every estimator S satisfies 

q 

the following. If T < 72 ; then 

sup E||sin9(cS,5)|| F 

V q (* 2 ,R q ) 

1 



(4.1) 



Otherwise, 



> C{ (Rg ~ d) 



u 1 ^ 1 

2 2 12' 

^l-log(T/ 7 i) > | 



1 

2 1 2 



(4.2) sup E||Bine(£,S)||jr> J ^ ^ Al 



n 



The case q = is particularly simple, because T < 72 =1 holds trivially. 
In that case, Theorem 4.1 asserts that 

sup E||sin0(iS,S)||j7 

(4.3) f 



1 

2 / m j \ } 2 



> c J w _<_( 1+log ^ A1 j ■ 

When q G (0, 2) the transition between the T < 72 and T > 72 regimes 
involves lower order (log log) terms that can be seen in eq. (4.15). Under 
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Condition 3, eq. (4.1) can be simplified to 



sup E||sine(5,5)|| F 



(4.4) 



a 



>c{ (R q -d)—[ l + (l-a)lo{ 



(p-d) 1 -^ 
R q -d 



2 2 



A 1 



Theorem 4.2 (Row sparse parameter estimation). Let q 6 [0,2) and 
(p,d,R q ) satisfy 

2<dand2d<R q -d<(p- df~2 , 

and let T and 7 be defined as in eq. (3.1). There exists an universal constant 
c > such that every estimator S satisfies the following. IfT< (^7)2, then 



da^ 

n 



1-2 



(4.5) sup E\\sm Q(S, S)\\ F > c<(R q - d) 

V q (^,R q ) { 

Otherwise, 

( d(p - d)a 2 



Ad 



(4.6) sup E||sine(5,5)||F > c< 

V q (cr 2 ,R q ) 



Ad 



n 



This result with Equation (4.3) implies Theorem 3.1, and with Equation (4.4) 
it implies Theorem 3.2. 

Theorem 4.3 (Column sparse estimation). Let q G [0,2) and (p,d,R q ) 
satisfy 

4 < - d)/d and d < d(R q - 1) < (p - d) 1 "? , 

and recall the definition ofT* in eq. (3.2). There exists a universal constant 
c > such that every estimator S satisfies the following. IfT* < 72 , then 



sup E||sinG(S,S)|| F 

v*{o 2 ,R q ) 



(4.7) 

Otherwise, 
(4.8) 



> c{ d(R q - 1) 



— (i-iog (iy 7 

n 



Ad 



sup E||sin6(<S,<S)||ir > c< 



Ad 



r? 
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In the next section we setup a general technique, using Fano's Inequal- 
ity and Stiefel manifold embeddings, for obtaining minimax lower bounds 
in principal subspace estimation problems. Then we move on to proving 
Theorems 4.1 to 4.3. 

4.1. Lower bounds for principal subspace estimation via Fano's method. 
Our main tool for proving minimax lower bounds is the generalized Fano 
method. We quote the following version from (Yu, 1997, Lemma 3). 

Lemma 4.1 (Generalized Fano method). Let N > 1 be an integer and 
{#!,..., #7v} C index a collection of probability measures ¥g. on a mea- 
surable space (X,A). Let d be a pseudometric on and suppose that for all 

d{6i,6j) > a N 
and, the Kullback-Leibler (KL) divergence 

D<p 6i \\P ej ) < fa ■ 
Then every A-measurable estimator 9 satisfies 

maxE^Mi) > ^ 

i I 

The calculations required for applying Lemma 4.1 are tractable when 
{P^} is a collection of multivariate Normal distributions. Let A 6 V PjC i 
and consider the mean zero p-variate Normal distribution with covariance 
matrix 

(4.9) S(A) = bAA T + I p = (1 + b)AA T + (I p - AA T ) , 

where b > 0. The noise-to-signal ratio of the principal (i-dimensional sub- 
space of these covariance matrices is 

2 1 + 6 

= 6^ 

We can choose b so that (1 + b)/b 2 = a 2 . The KL divergence between these 
multivriate Normal distributions has a simple, exact expression given in 
the following lemma. The proof is a straightforward and contained in the 
appendix. 



&v + log2 
logiV 
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Lemma 4.2 (KL divergence). For i = 1,2, let Ai G Y p ^, b > 0, 
E(Ai) = (1 + b)AiAj + (J p - AiAf ) , 
and Pj 6e the n-fold product of the M(0, £(^4j)) probability measure. Then 

D(Pi||P 2 ) = ^||sin(A 1 ,A 2 )|||. 

The KL divergence between the probability measures in Lemma 4.2 is 
equivalent to the subspace distance. In applying Lemma 4.1, we will need to 
find packing sets in V p ^ that satisfy the sparsity constraints of the model and 
have small diameter according to the subspace Frobenius distance. The next 
lemma, proved in the appendix, provides a general method for constructing 
such local packing sets. 

Lemma 4.3 (Local Stiefel embedding). Let 1 < k < d < p and the 
function A t : Y p _d,k l— ^p.d be defined in block form as 



(4.10) A e (J) 



h- k 
eJ 



for < e < 1. // Ji, J 2 G Vp_ difc , then 

e 2 (l - e 2 )|| Jt - J 2 \\ 2 F < WsmiA^^^ml < e 2 || J x - J 2 \\ 2 F . 

This lemma allows us to convert global 0(l)-separated packing sets in 
^p-d,k into (9(e)-separated packing sets in V p ,d that are localized within a 
0(e)-diameter. Note that 

\\Ji — Jj\\f < II^IIf + II</j'IIf < 2vk. 

By using Lemma 4.3 in conjunction with Lemmas 4.1 and 4.2, we have the 
following generic method for lower bounding the minimax risk of estimating 
the principal subspace of a covariance matrix. 

Lemma 4.4. Let e G [0, 1] and { Ji, . . . , Jn} C Y p ^ dtk for 1 < k < d < p. 
For each i = 1, . . . , N, let Pj be the n-fold product of the M(0, T<(A e ( Ji))) 
probability measure, where £(•) is defined in eg. (4.9) and A e (-) is defined 
in eg. (4.10). // 

min|| Ji — Jj\\f > &N , 
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4nke 2 /a 2 + log 2 
logiV 



max Ei|| sin 6(^4, Ai)\\ F > — 1 

i 2 

where a 2 = (1 + 6)/6 2 . 

4.2. Proofs of the main lower bounds. 



Proof of Theorem 4.1. The following lemma, derived from (Massart, 
2007, Lemma 4.10), allows us to analyze the variable selection aspect. 

Lemma 4.5 (Hypercube construction). Let m be an integer satisfying 
e < m and let s £ [1, m]. There exists a subset {Ji, . . . , J^} C Y m i satisfy- 
ing the following properties: 

1- 11-^11(2,0) < s for all i, 

2. || Jj — Jj\\?, > 1/4 for all i / j, and 

3. log TV > max{ cs [1+ log (m/s)] , log(m)}, where c > 1/30 is an absolute 
constant. 

q l—i 

Proposition 4.1. If J e V m ,d and q e (0,2], then \\J\\\ q < d2||j|| y 

Let p S (0, 1] and {Ji, . . . , Jjy} Q Y m ,i be the subset given by Lemma 4.5 
with m = p — d and s = max{l, (p — d)p}. Then 

logiV > max{cs(l + log[(p — d)/s]) , log(p — d)} 

> max{(l/30)(p — d)p(l — log p) , log(p — d)} . 

Applying Lemma 4.4, with k = 1, 5n = 1/2, and b chosen so that (l+b)/b 2 = 
a 2 , yields 



max Ej 1 1 sin 6 (v4.,.4.j) 1 1 .f > 



(4.11) 



> 



4V2 
e 

4^/2 
e 

475 



4ne 2 jo 1 



log 2 



(l/30)(p-d)p(l-logp) log(p-d) 
120e 2 log 2 



7p(l-logp) log(p-d) 
120e 2 



7p(l - logp) 
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for every estimator A and all e G [0, l/\/2], because p — d > 4 by assumption. 
Since J{ G Vp_d i, Proposition 4.1 implies 

eZ + s , if g = , and 

(4.12) P e (Ji)|| 2 , ? <<( / ^xi/, 

d + e 9 s 2 , if < g < 2 . 



For every g £ [0, 2) 

9 s 2 -^ max{l, (p - d)p} 2 -i 

Thus, eq. (4.12) implies that the constraint 
(4.13) 6 2<7 <min{(T/p)V, (R q ~ d) 2 } 

is sufficient for A% G Ai q (R q ) and hence Pj G V q (a 2 ,R q ). Now fix 

2 1 . 1 

e = - — 7/0(1 — log p) A - . 
480 ,FV BF; 2 

If we can choose p G (0, 1] such that eq. (4.13) is satisfied, then by eq. (4.11), 
sup E||sin0(«S, S)\\p > max Ej||sin Q(A, Ai)\\f > j= ■ 

Vq(a 2 ,Rq) i 16V2 

Choose p G (0, 1] to be the unique solution of the equation 



(4.14) 



T[7(l -logp)] 2 , if T< 7§, and 
1 , otherwise. 



We will verify that e and p satisfy eq. (4.13). The assumption that 1 < R q — d 
guarantees that e 2q < (R q — d) 2 , because e 2q < 1. If T < 72 , then 

(T/p) 2 pi=[ 7 p(l-logp)] q >e 2 i. 

q 

If T > 72 , then p = 1 and 

(T/p) 2 / = T 2 > 7" > e 29 . 
Thus, eq. (4.13) holds and so 



sup E||sin 9(5,5) || F > > -!- U(l - log p) 

P q (^,R q ) 16V2 496 L 



^ A 1 

A 32" 
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Now we substitute eq. (4.14) and the definitions of 7 and T into the above 

q 

inequality to get the following lower bounds. If T < 72, then 



7p(l -logp) = T7 1 ajl-logp} 
(4.15) = T 7 1 ^ jl - log (T/ 7 §) + I log(l - logp) 

> T^-il 1 - log (T/75; 




1-2 
x 2 



and so 



sup E||sine(5,5)||F 

V q {a\R q ) 



>cq{ (R q -d) 



(1 - log (r/ 7 * 



^ 2 



A 1 



If T > 72 , then 7/3(1 — logp) = 7 and 



sup E||sine(5,5)|| F > c (j A l) 2 = c < 

V q {a\R q ) 



□ 



Proof of Theorem 4.2. For a fixed subset of s variables, the chal- 
lenge in estimating the principal subspace of these variables is captured by 
the richness of packing sets in the Stiefel manifold V s <j. A packing set in 
the Stiefel manifold can be constructed from a packing set in the Grassman 
manifold by choosing a single element of the Stiefel manifold as a representa- 
tive for each element of the packing set in the Grassmann manifold. This is 
well-defined, because the subspace distance is invariant to the choice of ba- 
sis. The following lemma specializes known results (Pajor, 1998, Proposition 
8) for packing sets in the Grassman manifold. 

Lemma 4.6 (see Pajor (1998)). Let k and s be integers satisfying 1 < 
k < s — k, and let 6 > There exists a subset { J±, . . . , Jjv} C Y s> k satisfying 
the following properties: 

1. ||sin(Jj, Jj)||f > y/k5 for all i 7^ j, and 

2. logiV > k(s — k)\og(c2/S), where C2 > is an absolute constant. 
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To apply this result to Lemma 4.4 we will use Proposition 2.2 to con- 
vert the lower bound on the subspace distance into a lower bound on the 
Frobenius distance between orthonormal matrices. Thus, 



(4.16) 



\Ji — JjWf ^ 1 1 sin Q(Ji,J s )\\ F > VkS. 



Let p G (0, 1] and s = max{2ci, \_(p — d)p\}. Invoke Lemma 4.6 with k = 
d and 5 = c%/e, where C2 > is the constant given by Lemma 4.6. Let 
{Ji, . . . , Jn} C Vp-dd be the subset given by Lemma 4.6 after augmenting 
with rows of zeroes if necessary. Then 

logiV > d(s -d)> max{d{s/2),d 2 } > max{(d/4)(p - d)p,d 2 } 



and by eq. (4.16), 



\Ji ~ JjWl > d{c 2 /e) 2 



for all i j. The rest of this proof mirrors that of Theorem 4.1. Let e G 
[0, l/y/2] and apply Lemma 4.4 to get 



max E || sin G (^4, A)||f > 



(4.17) 



C2\fde 
2y/2e 

> ciVde 



Ande 2 /a 2 log 2 



(d/4)(p-d)p d 2 
1 16e 2 " 



IP 



where 7 is defined in eq. (3.1) and we used the assumption that d > 2. Since 
Ji G *y p -d,d, Proposition 4.1 implies 



\MJi)h, q 



For every q G (0, 2] 



d + d^e q s 1 ^ L < R a 



d + s . 



< 



if q = , and 



2-g 



d + d2e q s—) , if < g < 2 . 



d q e 2q < Cgg d) 



(R q - d) 2 



s 2 ~i max{2d, (p - d)p} 2 - ( ' ' 



So e and p must satisfy the constraint 
(4.18) d q e 2q < mini (T/p) 2 p q 



(Rq ~ d) 2 

{2d) 2 -i 



to ensure that P« G ^(o -2 , -Rg). Fix 
(4.19) e 



— 7P A - 
64 w 2 
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and 
(4.20) 

Since e 2 < 1/2, 



[T{drf)-l if T < (d 7 )5 , and 
1 1 , otherwise. 



2q ^ (Rq ~ dY 



d q e Zq < 



2q < ( R q - d)' 



(2d) 



2-q 



2 q e lq < 



4d 2 



2d<R q -d, 



where the right-hand side is an assumption of the lemma. That verifies one 
of the inequalities in eq. (4.18). If T< (d 7 )2, then 

(T/p) 2 p q = (djp) q p q > d q e 2q . 

If T > (^7)2 , then p = 1 and 

(Tjpfp q = t 2 > (d^y > d q e 2q . 

Thus, eq. (4.18) holds and by eq. (4.17), 

sup E||sinO(<S,<S)||i7 > maxEj||sinB(^4, «4j)||.F 



> ci^/de 



>^Vde 



1 16e A 



9 2 ~i 
7 q P. 



> cq ( d"fp A d 



Finally, we substitute the definition of 7 and eq. (4.20) into the above in- 
equality to get the following lower bounds. If T < (^7)2 , then 



sup E||sin9(5,5)|| F > co\T(dj) 1 ~i /\d\ 



c < (R q - d) 



da^_ 

n 



1-1 



A d 



If T> (d7)s, then 



sup E||sin(V, V)||,p > c (d 7 Ad) 2 = c < 



d(p — d)a' 



Ad 



V q (a 2 ,R q ) 



n 



□ 
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Proof of Theorem 4.3. The proof is a modification of the proof of 
Theorem 4.1. The difficulty of the problem is captured by the difficulty of 
variable selection within each column of V. Instead of using a single hy- 
percube construction as in the proof of Theorem 4.1, we apply a hypercube 
construction on each of the d columns. We do this by dividing the (p — d)xd 
matrix into d submatrices of size \_(p — d)/d\ x d, i.e. constructing matrices 
of the form 

[Bf Bl ••• Bl ..f 

and confining the hypercube construction to the fcth column of each [(p — 
d)/d\ x d matrix B^ , k = 1, . . . , d. This ensures that the resulting (p — d)xd 
matrix has orthonormal columns with disjoint supports. 

Let p G (0,1] and s G max{l, [(p — d)/d\p}. Applying Lemma 4.5 with 
m = [(p — d)/d\ , we obtain a subset {J±, . . . , Jm} ^ V mj i such that 

1. \\Ji\\o < s f° r all i, 

2. || Ji - Jj\\l > 1/4 for all i / j, and 

3. logA-f > max{cs(l + log (m/s)) , logm}, where c > 1/30 is an absolute 
constant. 

Next we will combine the elements of this packing set in V mj i to form a pack- 
ing set in Y p -d,d- A naive approach takes the d-fold product {J\, . . . , Jm} 1 *, 
however this results in too small a packing distance because two elements 
of this product set may differ in only one column. 

We can increase the packing distance by requiring a substantial number 
of columns to be different between any two elements of our packing set 
without much sacrifice in the size of the final packing set. This is achieved 
by applying an additional combinatorial round with the Gilbert- Varshamov 
bound on M-ary codes of length d with minimum Hamming distance d/2 
(Gilbert, 1952; Varshamov, 1957). The kth coordinate of each code specifies 
which element of { Ji, . . . , Jm} to place in the fcth column of B^, and so any 
two elements of the resulting packing set will differ in at least d/2 columns. 
Denote the resulting subset of Y p ^d,d by H s . We have 

1. ||#||*,0 < s for all H G W. 

2. jjlZi - H 2 \\l > d/8 for all H h H 2 G W such that R x / H 2 . 

3. log iV := log 1 7^ s | > m&x{cds(l + log(m/s)), log m}, where c > is an 
absolute constant. 

Note that the lower bound of logm in the 3rd item arises by considering 
the packing set whose iV elements consist of matrices whose columns in 
Bi, . . . , Bd are all equal to some Jj for i = 1, . . . , M. This ensures that 
log iV > log M > log m. From here, the proof is a straightforward modifica- 
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tion of proof of Theorem 4. 1 with the substitution of p — d by (p — d)/d. For 
brevity we will only outline the major steps. 

Recall the definitions of T* and 7 in eq. (3.2). Apply lemma 4.4 with the 
subset T~L S , k = d, 5n = Vd/V8, and b chosen so that (1 + b)/b 2 = a 2 . Then 



maxE||sin0(^4,^4j)|| j p > coVde 

i 

> CQ^fde 



4ne 2 /a 2 



1 



log 2 

cmp(l — log p) log m 



(8/c)de 5 



4 7p(l-logp) 
by the assumption that (p — d)/d > 4, and 



1 + s , if q = , and 

/ 1 + ei S — ) , if < q < 2 . 



The constraint 



^e 2 « < minKr./p) 2 ^ 9 , <P(R q - l) 2 } 



ensures that Fj G V*(a , R q ). It is satisfied by choosing e so that 

cfe 2 = ci7p(l — log p) A - , 

where ci > is a sufficiently small constant, the assumption that d < 
d(R q — 1), and letting p be the unique solution of the equation 

= fr„[7(l - logp)]-§ , if T* < 72 , and 
I 1 , otherwise. 



We conclude that every estimator V satisfies 



sup E||sin6(5,5)|| F > c 2 < 7p(l - logp) A d 



and we have the following explicit lower bounds. If T* < 72 ; then 



sup E||sin9(5,5)|| F 



>c 3 {d(R q -l) 



a 

n 



1 - log (T*/7§) 



Ad 
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If T* > 7 5 , then 



i 

r 2 1 2 



sup E||sinG(S,S)|| F > c 3 < — — — A d } . 

V*{a\R q ) { n J 

5. Upper bound proofs. 

5.1. A variational approach to the perturbation of spectral subspaces. The 
following result allows us to bound the curvature of the matrix functional 
F i— > (A, F) at its point of maximum on the Grassmann manifold. 

Lemma 5.1 (Curvature Lemma). Let A be a p x p positive semidefinite 
matrix and suppose that its eigenvalues Xi(A) > • • • > X p (A) satisfy \d(A) > 
Ad+i(^4) for d < p. Let £ be the d- dimensional subspace spanned by the 
eigenvectors of A corresponding to its d largest eigenvalue, and let E denote 
its orthogonal projection. If J 7 is a d- dimensional subspace of MP and F is 
its orthogonal projection, then 

1^9(^)111 < ^ E _^ (Ay 

Using this lemma we have the following alternative to the traditional 
matrix perturbation approach to bounding subspace distances using the 
Davis-Kahan sin G Theorem and Weyl's Inequality. 

Lemma 5.2 (Variational sinG). In addition to the hypotheses of Lemma 5.1, 
if F satisfies 

(5.1) (B,E)-g(E)<(B,F)-g(F) 
for some function g : W xp i— >■ R, then 

(B-A,F-E)-[g(F)-g(E)] 



(5.2) ||sinG(£,.F)llF < 



X d (A) - X d+1 (A) 



The lemma is different from the Davis-Kahan sin G theorem because the 
orthogonal projection F does not have to correspond to a subspace spanned 
by eigenvectors of B. F only has to satisfy 

(B,E)-g(E)<(B,F)-g(F). 

This condition is suited ideally for analyzing solutions of regularized and/or 
constrained maximization problems where E and F are feasible, but F is 
optimal. Both Lemma 5.1 and Lemma 5.2 are proved in the appendix. 
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5.2. Proofs of the main upper bounds. X and S n are both invariant un- 
der translations of /x. Since our estimators only depend on X\, . . . , X n only 
through S n , we will assume without loss of generality that /i = for the 
remainder of the paper. The sample covariance matrix can be written as 



T 



1 n 1 n 

S n = - VpQ - X){X t -X) T = -Y] XiXf - XX 

i=l i=l 

It can be show that XX T is a higher order term that is negligble (see the 
proofs in Vu and Lei, 2012a, for an example of such arguments). Therefore, 
we will ignore this term and focus on the dominating - Yli=l XiXf term in 
our proofs below. 

Proof of Theorem 3.4. We apply Lemma 5.2 taking A = S, B = S n , 
E = VV T , and F = VV T , where V is a solution of eq. (2.8). Since VV T 
and VV T are feasible and VV T is optimal, 



(S n ,VV T ) < (S n ,VV T ' 



Thus, 



and 



e := ||sm0(6,o)|| F < — 



— Arf+l 



V2 /„ „ VV T -VV T 



(5.3) e 2 < - — v — ( S n - X, > e , 

V ' -A d -A d+1 \ n \\ VV t _ VV T\\ F I 

because \\VV T - VV T f F = 2e 2 by eq. (2.6). Let 

yyT _ yyT 



A 



\\VV T -VV T \\ F 



Then ||A||2,o < 2i?o> II^IIe = L and A has at most d positive eigenvalues 
and at most d negative eigenvalues (see Proposition 2.1). Therefore, we can 
write A = AA T - BB T where ||A|| 2 ,o < 2i? , \\A\\ F < 1, A e W xd , and the 
same holds for B. Let 

U(R ) ={Ue R pxd : P|| 2 ,o < 2Ro and \\U\\ < 1} . 

Equation (5.3) implies 

Ee< 2V ^ — E sup \(S n - Z,UU T }\. 
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The empirical process (5„-E, UU T ) indexed by U is a generalized quadratic 
form, and a sharp bound of its supremum involves some recent advances in 
empirical process theory due to Mendelson (2010) and extensions of his 
results. By Corollary 4.1 of Vu and Lei (2012b), we have 

E sup \(S n - Z,UU T )\ 
ueU(R ) 

jEsup UeU{R()) (Z,U) /Esup UeU{Ro) (Z,U)\ 2 
^ CM < = h 



7n — + { — w — ) 



where Z is a p x d matrix of i.i.d standard Gaussian variables. To control 
Ksup Uel j(Z, U), note that 



(Z,U) < ||2|| 2>00 ||[/|| 2 ,i < \\Z\\ 2)OO ^/2R , 

because U £ U(Rq). Using a standard 5-net argument (see Propositions B.l 
and B.2), we have, when p > 5, 



(5-4) ||||Z|| 2 ,oo||^ <4.15Vrf + logp. 

and hence 

E sup(Z, U) < 6V^o(d + logp) 

UeU 



The proof is complete since we assume that 6 v /i? (^ + logp) < y/n. □ 



Proof of Theorem 3.5. Again, we start from Lemma 5.2, which gives 
e := ||smB(<S,o)||j7 < — 



Ad — Ad+i 

To get the correct dependence on Aj and for general values of q, we need a 
more refined analysis to control the random variable (S n — E, — yU T ). 
Let 

W:=S n -Z, U :=VV T , and LI := VV T . 

For any projection matrix II we write H 1 - := I — II, the projection onto the 
orthogonal complement. By Proposition A.l we have 

(5.5) (w, n - n) = (w, nn^n) + (w, n ± nn) + (w, n^mr 1 ) 

(5.6) =: T 1 +T 2 + T 3 

We will control T\ (the upper-quadratic term), T 2 (the cross-product term), 
and T% (the lower-quadratic term) separately. 
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Controlling T\. 

(5.7) Ti =(w, nn- L n) = (lwlt, nn^n) 

<||nwn|| 2 ||nn- L n||* = ||nwn|| 2 ||nf[- L n- L n|| ll , 
=||nwn|| 2 ||nn- L ||F < ||nvrn|| 2 e 2 , 

where ||-||* is the nuclear norm (l\ norm of the singular values) and ||-|| 2 is 
the spectral norm (or operator norm). By Lemma B.5, we have (recall that 
we assume ||Z||^ 2 < 1 and e n < 1 for simplicity), 



(5-8) ||||nwn|| 2 ||^ <ciAivW 

where c\ is a universal constant. Define 

f2i = < T\ > c\\\ — log n\\e 2 > . 
I Vn J 

Then, when n > 2 we have 

(5.9) P(fii) <F (||nV^n|| 2 > ciAilogny 7 ^) < (n - ly 1 
Controlling T 2 . 



(5.10) 



r 2 = (w,u x uu) = (n 1 wn,n i n) 
< ||n- L wn||2,oo||n- L n|| 2 ,i- 



To bound ||U fl|[2,i 5 let the rows of n- L f[ be denoted by <p\ , . . . , 4> p and 
t > 0. Using a standard argument of bounding i\ norm by the i q and £2 
norms (e.g., Raskutti, Wainwright and Yu, 2011, Lemma 5), we have for all 
t > 0, < q < 1, 



(5.11) 



in^ni 



2.1 



< 



i=l 
P 

D 

i=l 



>i 2 



MS 



1/2 



En* 



* 112 



1/2 



-9/2. 



i=l 



= ||n ± n||^ 2 ||n ± n|| F t-'?/ 2 + ||n ± fi||^/-'? 

< V2R\' 2 r q ' 2 e + 2 J R/~ 9 , 



t l- q 
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where the last step uses the fact that 

lin^nn^ = ||n^||^ = \\v - uv\\l >q < \\v\\l q + \\vv T v\\l q 
<\\v\\l q + \\v\\l q <2R q . 

Combining eqs. (5.10) and (5.11) we obtain, for all t > 0, < q < 1, 
(5.12) T 2 < lin^nl^oo (V2R}/ 2 t^ 2 e + 2R q t 1 ^ . 

The case where q = is simpler and omitted. Now define 

^2 :=|t 2 > 20(yATA^7 1 ^ /2 (A d - X d+1 )^ 2 e n e 
+ i/Wd+i 2 " \ d+ i)-^- q) el) 



*2,i =20 a/ '\\\d+i\j 



d + log p 



n 



^ _ VAiAd+i l d + logp 
2,2 A d - A d+ i V n 

Taking i = £2,2 m eq. (5.12) and using the tail bound result in Lemma B.l, 
we have 

(5.13) p(n 2 ) <P(||n J -wn|| 2 , 0O > t 2 ,i) 

rf / *2l/8 \ 

<2p5 d exp ^ ^=^- 

y 2X 1 X d+1 /n + t 2 ,iy / XiXd+i/n J 

<jp _1 . 

Controlling T3. The bound on T3 involves a quadratic form empirical pro- 
cess over a random set. Let e > and define 

<P(R g ,e) :=sup{(^,n ± C/C/ T n ± ) : U € Y p , d , ||CT||« ff < R q , Hll^l^ < e} . 
Then by Lemma B.4, we have, with some universal constants C3, for a; > 

P {<t>{R q , e) > c 3 xA d+1 (e n e 2 + e^e + e^)) < 2exp(-x 2 / 5 ). 
Let T 3 (C0 = (W^C/t^D^), for all U G Z^C-Rg), where 

U p {R q ) := {U G Y p>d : span(C7) G M p (JSg)} • 
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Define function g{e) = e n e 2 + e 2 e + e„. Then for all e > 0, we have g{e) > 
<4i — 4d 3 /n 2 . On the other hand, if e = ||sin(J7, V) || jr, then e 2 < 2d and 
hence g(e) < g(V2d) = 2d + y/2d+l. Let \i = e£ and J = [log 2 (5f(\/2d)//x)] . 
Then we have J < 3 log n + 6/5. 

Note that g is strictly increasing on [0, y/2d\. Then we have the following 
peeling argument. 



3Ue U p (R q ) : T 3 (U) > 2c 3 (logn)^ 2 g(\\sm(U, V)\\ F ) 
31 < j < J, Ue U p {R q ) : y'-V < 5 (l|sin(f/,y)|| F ) < 2^, 
r 3 (C0 > 2c 3 (logn) 5 / 2 5 (||sin(^y)|| F ) 



j 

<J]P [^(i? 9 , 5 - 1 (2V)) > c 3 (logn) 5 / 2 2V 



<J2n" 1 <^ + ^. 



■/?. 



71 



Define 

ft 3 := >c 3 (logn) 5 / 2 A d+1 ( 
Then we have proved that 



e n e 2 + el e + ei 



. 6 log 7i 3 

P «3 < — + -■ 

71 71 

Pirf things together. Now recall the conditions in Equations (3.3) to (3.6). 
On Q,\ (~) fi§ H we have, from eq. (5.5) that 

(Ad-A d+ i)e 2 < ( cn/-log7iAi +c 3 e n (logn) 5/2 A d+ i ) e 2 



/i 



-1-9/2 



+ 2iyA7A^i" X y/i (A d -A d+1 )^ 2 e n e 



-2-9, 



+ 21-y/ AiAd+i "'(Ad - \d+i)~ 1 



"(1-9)^2 



J(A d - A d _0e 2 < 21V^7i 1 q '\\ d ~ X d+ i) q/2 e n e 



+ 21a/AiA^+i ^(Arf — Ad+i) 

1-9/2 



-(1-9) F 2 



e < 9 



Ad — Xd+i 



□ 
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6. Discussion. We have derived non-asymptotic minimax upper and 
lower bounds for principal subspace estimation over two classes of sparse 
subspaces. In the row sparse case, our upper and lower bounds match up to 
constants and are optimal in (n,p,d, R q ,o~ 2 ) in the sparse, high-dimensional 
regime. In the column sparse case, our upper and lower bounds match up to 
constants and are optimal in (n,p, R q ,a 2 ). We conjecture that the d + logp 
term that appears in the column sparse upper bound (Corollary 3.1) can 
be improved to 1 + logp, and thus match the lower bound (Theorem 3.3). 
It appears to us that the primary obstacle is tightening our analysis of a 
cross-product term (T2 in eq. (5.6)) that roughly corresponds to the cross- 
covariance of the principal subspace and its orthocomplement. This is an 
interesting technical challenge, but after all, deriving non-asymptotic bounds 
that are optimal in all five parameters (n,p,d, R q ,o~ 2 ) seems too ambitious. 

Interestingly, in the case d = 1 (where row and column sparsity coincide), 
the form of the minimax optimal error for the principal subspace estima- 
tion problem parallels that for the coefficient vector in the sparse linear 
model (see Raskutti, Wainwright and Yu, 2011) with the noise-to-signal ra- 
tio a 2 playing the same role in the error as the noise variance in the linear 
model. For d > 1, we suspect that this parallel relationship will continue to 
hold with the multivariate (or multiple response) sparse linear model un- 
der appropriate sparsity conditions. However, minimax rates have yet to be 
established for that problem. 

The nature of this work is theoretical and it leaves open many challenges 
for methodology and practice. The minimax optimal estimators that we 
present appear to be computationally intractable because they involve con- 
vex maximization rather than convex minimization problems. Even in the 
case q = 1, which corresponds to a subspace extension of t\ constrained 
PCA, the optimization problem remains challenging as there are no known 
algorithms to efficiently compute a global maximum. Finally, although the 
minimax optimal estimators that we propose do not require knowledge of 
the noise-to-signal ratio a 2 , they do require knowledge of (or an upperbound 
on) the sparsity R q . It is not hard to modify our techniques to produce an es- 
timator that gives up adaptivity to a 2 in exchange for adaptivity to R q . One 
could do this by using penalized versions of our estimators with a penalty 
factor proportional to a 2 . An extension along this line has already been con- 
sidered by Lounici (2012) for the d = 1 case. A more interesting question is 
whether or not there exists fully adaptive principal subspace estimators. In 
other words, under what conditions can one find an estimator that achieves 
the minimax optimal error without requiring knowledge of either a 2 or R q ? 
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APPENDIX A: ADDITIONAL PROOFS 

Proof of Proposition 2.2. Let 7, be the cosine of the ith canonical 
angle between the subspaces spanned by V\ and V 2 . By Theorem II. 4. 11 of 
Stewart and Sun (1990), 

inf \\Vi-V 2 Q\\ 2 F = 2£(l- 7i ). 

I 

The inequalities 

1 -x < (1 - x 2 ) < 2(1 - x) 
hold for all x £ [0,1]. So 

\ inf \\V 1 -V 2 Qf F <Y J ^-ll)< n ^ \\Vi-V 2 Qf F . 

Apply the trigonometric identity sin 2 9 = 1 — cos 2 9 to the preceding display 
to conclude the proof. □ 

A.l. Proofs related to the lower bounds. 

PROOF of Lemma 4.2. Write £; = £(A) for i = 1,2. Since Si and E 2 
are nonsingular and have the same determinant, 

£>(Pi||P 2 ) = 7iD(jV(0,Ei)||JV(0,S 2 )) 
n 



Now 

and 

Thus, 



-{TrtX-^O-p-logdet^Ex)} 



S- 1 = (1 + b)- x A 2 Al + (J„ - A 2 A\) 
Ei - E 2 = - A 2 A^) . 



^(E^CEi-E,)) 

{(1 + 6)(J„ - A 2 Al,A x A\) - (A 2 A T 2 ,A 2 Al - A 1 A^)} 



1 + 6 



b —^-{b(I p - A 2 Al,AtAl) - (I P ,A 2 AT - A 2 AlA x A\)} 

^{(l + 6)(I p - A^lUi^f) - (A 2 All p -A X A^)} 
b 2 

^||sin(^i, A 2 )f F , 
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by Proposition 2.1. □ 
PROOF of Lemma 4.3. By Proposition 2.1 and the definition of A £ (-), 

IMiWi), MJ 2 ))f F = l\\[MJi)][MJi)] T - [A e (J 2 )][A e (j 2 )] T f F 



= e 2 (l - e 2 )|| Ji - J 2 f F + —|| JiJl - J 2 4\\ 2 F 

> ^(i — c 2 )!!^. — j 2 |||, 



The upper bound follows from Proposition 2.2: 

\\B3Jl(MJl),MJ2))\\F < WMJl) ~ MJ2)\\f = £ 2 \\Jl ~ Mf ■ □ 

Proof of Lemma 4.5. Let so = [vam(m/e,s)\. The assumptions that 
m/e > 1 and s > 1 guarantee that so > 1. According to (Massart, 2007, 
Lemma 4.10) (with a = 7/8 and (3 = 8/(7e)), there exists a subset C 
{0, l} m satisfying the following properties: 

1. |M|o = so fc> r ai l w £ 

2. ||u> — u)'\\o > so/4 for all distinct pairs uj,uj' G f2^°, and 

3. log|Q£°| > cso log(m/so), where c > 0.251. 

Let 

{Jl,...,Jjv} :={s„A: W £!i:}. 
Clearly, {Ji, . . . , J N } C V m ,i and 

||^i||(2,o) = IMIo = so < s 

for every z. If i ^ j, then 

- Jj\\% = Sq 1 ]!^ - u}j\\ > 1/4. 

The cardinality of {J±, . . . , Jn} satisfies 

logiV = log|^| > cs log(m/s ) • 

As a function of so, the above right-hand side is increasing on the interval 
[0, m/e]. Since min(m/e, s)/2 < so belongs to that interval, 

logiV > c(min(m/e, s)/2) log[m/(min(m/e, s)/2)] 
> (c/2) min(m/e, s) log[m/ min(m/e, s)] . 
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It is easy to see that 

min(m/e, s) \og[m/ min(m/e, s)] > max{s log(m/s) , s/e] 
for all s £ [l,77i]. Thus, 

min(m/e, s) log[m/ min(m/e, s) > (1 + e)" 1 ^ + (1 + e)~ 1 s log(m/s) 

and 

(A.l) log N > (c/2)(l + e) _1 s(l + log(m/s)) , 

where (c/2)(l + e) _1 > 1/30. If the above right-hand side is < logm, then we 
may repeat the entire argument from the beginning with {J\, . . . , Jjy} taken 
to be the N = m vectors {(1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1)} C 
{0, l} m . That yields, in combination with eq. (A.l), 

log N > max{(l/30)s[l + log(m/s)] , logm} . □ 

A. 2. Proofs related to the upper bounds. 

Proof of Lemma 5.1. For brevity, denote the eigenvalues of A by A^ := 
X d (A). Let A = YTi=i^i u i u ! be the spectral decomposition of A so that 

E = Ef=i u i u I and E± = Ef=d+i u i u I ■ Then 

(A, E-F) = (A, E(I -F)-(I- E)F) 
= {EA,F L ) - (E ± A,F) 

d p 

i=l i=d+l 

d p 

> \ d ^2{ Ui uJ,F L ) - X d+l (uiuJ,F) 

i=l i=d+l 

= X d (E,F ± )-X d+1 (E ± ,F). 

Since orthogonal projections are idempotent, 

XdiEiF^ - X d+1 {E X ,F) = X d (EF ± ,EF ± ) - X d+1 {E ± F, E x F) 

= ^dWEF^-Wl — X d+ x\\E F\\% . 

Now apply Proposition 2.1 to conclude that 

X d \\EF L g - Xa+tl^Ffp = (A d - A d+ i)||sin &(£,F)\\ 2 F . □ 
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Proof of Lemma 5.2. Equation (5.1) is equivalent to 
0<{B,F-E)-[g(F)-g(E)}. 

Then by Lemma 5.1, 

[\ d (A) - \ d+l (A)]\\smQ(£,F)\\ 2 F < —(A,F — E) 

<{B-A,F-E)-\g(F)-g(E)]. □ 

Proposition A.l. If W is symmetric, and E and F are orthogonal 
projections, then 

(A.2) {W, F-E) = {E L WE^,F) - {EWE, F 1 ) + 2{E X WE, F) . 
Proof. Using the expansion 

W = E^WE 1 - + EWE + EWE 1 - + E L WE 
and the symmetry of W, F and E, we can write 

(W, F-E) = {E ± WE ± ,F - E) + {EWE, F-E) 

+ 2{E ± WE, F-E) 
= {E ± WE ± ,E ± {F - E)) + {EWE, E(F - E)) 

+ 2{E ± WE, E ± (F — E)) 
= {E ± WE ± , F) + {EWE, E(F - E)) + 2{E ± WE, F) . 

Now note that 

E(F - E) = EF - E = —EF 1 - . □ 

APPENDIX B: EMPIRICAL PROCESS RELATED PROOFS 

B.l. The cross-product term. This section is dedicated to proving 
the following bound on the cross-product term. 

Lemma B.l. There exists a universal constant c > such that 

t 2 /8 \ 



P( 1 1 11^11112,00 >t)< 2p5 d exp| 



2\iXd+i/n + t y /\ 1 \ d+1 / 



n 



The proof of Lemma B.l builds on the following two lemmas. They are 
adapted from Lemmas 2.2.10 and 2.2.11 of van der Vaart and Wellner (1996). 
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Lemma B.2 (Bernstein's Inequality). Let Y\, . . . ,Y n be independent ran- 
dom variables with zero mean. Then 



i=i 



> t < 2exp| 



t 2 /2 



2 Ya=i\\ y i% 1 +*maxi< n ||li||^ 1 



Lemma B.3 (Maximal Inequality). Let Y%, . . . ,Y m be arbitrary random 
variables that satisfy the bound 

for all t > (and i ) and fixed a,b > 0. Then 



max Yi 

Ki<m 



< c( alog(l + m) + ^b\og(l + to) 



for a universal constant c > 0. 

We bound ||I1 (S n — S)II||2,oo by a standard 5-net argument. 

Proposition B.l. Let A be a px d matrix, (e±, . . . , e p ) be the canonical 
basis ofW p and Ms be a 5 -net o/S^ -1 f or some 5 E [0,1). Then 

WAWo 00 < (1 — <5) -1 max max(ei, Au) . 

l<j<Pu£Af s J 

Proof. By duality and compactness, there exists G and u G Af$ 
such that 

||^4||2oo= max II All 2 = max (e,-, Au*) , 
i<i<P J i<j<p 

and ||tt* — u\\2 < d~- Then by the Cauchy-Schwarz Inequality, 

ll^-lh 00 = max (ej,Au) + (ej,A(u* - u)) 
i<i<p 

< max {ej , Au) + 5\\eJ A\\2 
i<i<p 

< max max (ej,Au) + S\\A\\2 oc ■ 
i<j<pu£M s 



Thus, 



|^4||2oo < (1 — ^) 1 max max(e,-, Au) . □ 

l<j<pu£Afs 



The following bound on the covering number of the sphere is well-known 
(see, e.g., Ledoux, 2001, Lemma 3.18). 
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Proposition B.2. Let Ms be a minimal S-net of for 5 £ (0,1). 
Then 

\M 5 \ < (l + 2/5) d . 
Proposition B.3. Let X and Y be random variables. Then 

||^y|IV>l — ll^ll</>2 ll^ll</>2 • 

Proof. Let A = X/\\X\\^ 2 and Y/||F||^ 2 . Using the elementary inequal- 
ity 

\ab\ < ^(a 2 + b 2 ) 
and the triangle inequality, we have that 



\AB\\^ < \(\\A*\\^ + WB'U,) = ±(\\A\\l + \\B\\l 2 ) = 1 . 



Multiplying both sides of the inequality by ||-X"||^ 2 \\Y ||^ 2 gives the desired 
result. □ 

Proof of Lemma B.l. Let Ng be a minimal 5-net in S^ 1 for some 
5 € (0, 1) to be chosen later. By Proposition B.l we have 

||n- L WII||2oo < — ~ — ^ max max^e^Ww), 

1-5 l<j<pu£N s J 

where ej is the jth column of I pxp - Taking 5 = 1/2, by Proposition B.2 we 
have \N S \ < 5 d . 

Now U^V = and so 

n 

(LL 1 ^, WVu) = -Y J {X l ,H ± e J ){X t ,Vu) 

i=l 

is the sum of independent random variables with mean zero. By Proposition B.3, 
the summands satisfy 

\\{Xi,U^ej){X u Vu)\\^ < || (X^^e,-) 11^11(^,^)11^ 

= \\{Z^ l l 2 ^e^m^ ll2 Vu)\U 2 

< llZxII^HS^n^e.-lbllS^Vnlb 

< \\Zi || y> 2 Ai Arf_|_i . 
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Recall that ||^||^ 2 = 1. Then Bernstein's Inequality (Lemma B.2) implies 
that for all t > and every u £ Ms 

|n- L Wn|| 2 oo > t) <P | max max(n x ei,WV«> > t/2 
) \i<j<pu€N 6 J 

<p5 d F(\(U ± e j ,WVu)\ >t/2 



(t 2 /8 \ 
• □ 
2AiA d+ i/n + t^J\i\ d+ i/n J 

B.2. The quadratic terms. 

Lemma B.4. Let e > 0, q e (0, 1], and 

e) = sup{(S n - S, IT^^IL 1 ) : 17 G V Pld , ||^||l )ff < 12, , 

lin^Hir < e}. 

There exists a constant c > suc/i £/ia£ 

ttj ; /- p w ll7 l|2 v \ E{Rg,e) E 2 (R q ,e) \ 

where 

E(R q ,e)=Esup{(Z,U) : U eV p , d ,\\U\\l tq <2R q ,\\U\\ F <e} 

and Z is a (p — d) x d matrix with i.i.d. M(0, 1) entries. Moreover, we have, 
for another numerical constant d , 



(B.l) < c'tRW-^e + R q t 2 ">) 



with t = Ji^SSE 



Proof. The first part follows from Corollary 4.1 of Vu and Lei (2012b). 
It remains for us to prove the 'moreover' part. By the duality of the (2, 1)- 
and (2, oo)— norms, 

(Z,U) < \\Z\\ 2>00 \\U\\ 2tl 

and so 

E(R q ,e) <E||2|| 2i00 sup{||C7|| 2) i : U € W p>d , \\U\\ q 2 < 2R q , \\U\\ F < e} . 
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By eq. (5.4) and the fact that the Orlicz fa-norm bounds the expectation, 

E||£||2,oo < c'^d + logp. 

Now 1 1 U 1 1 2,1 is just the l\ norm of the vector of row- wise norms of U. So we 
use a standard argument to bound the l\ norm in terms of the £ 2 and £ q 
norms for q £ (0,1] (e.g., Raskutti, Wainwright and Yu, 2011, Lemma 5), 
and find that for every t > 

\\U\\ 2 ,i < ||C/||^ 2 ||C/||2, 2 t- 9/2 + ||CA||«/-9 

= \\u\0\u\\ F t~^ + \\u\\l/-^. 

Thus, 



sup{||[/|| 2i i : U e Y p , d , \\U\\l q < 2R q , \\U\\ F < e} < R l q ' 2 r q l 2 + R q t l -« . 

Letting t = E||^||2 oo/VW) an d combining the above inequalities completes 
the proof. □ 

Lemma B.5. There exists a constant c > such that 

\\\\U(S n - E)n|| 2 ||^ < c||Zi||J 3 Ai(- s /d7n + d/n 

Proof. Let Ms be a minimal 5-net of S^ -1 f° r some 5 G (0, 1) to be 
chosen later. Then 

||II(S n - E)II|| 2 = \\V T (S n -Z)V\\ 2 < (l - 25)~ l msx\{Vu, (S n -E)Vu}\. 

ueA/j 

Using a similar argument as in the Proof of lemma B.l, for all t > and 
every u € Ms 



\(Vu,(S n -T,)Vu)\ >t) <2exp 



t 2 /2 



2a 2 /n + ta/n 



where a = 2\\Zi\\ 2 l) X\. Then Lemma B.3 implies that 



\lp2' 

|n(5 n -E)n|| 2 ||^ < (1-25)- 1 



max \ (Vu, (S n — Tj)Vu) 



y V n n I 
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where C > is a constant. Choosing 5 = 1/3 and applying Proposition B.2 
yields \Ms\ < 7 d and 

log(l + WH) <log(8)log(d). 

Thus, 

||||n(5 n -s)n|| 2 ||^ <7Ca(y/dfc + dfn). □ 
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